We consider large-scale studies in which thousands of significance tests areperformed simultaneously. In some of these studies, the multiple testingprocedure can be severely biased by latent confounding factors such as batcheffects and unmeasured covariates that correlate with both primary variable(s)of interest (e.g. treatment variable, phenotype) and the outcome. Over the pastdecade, many statistical methods have been proposed to adjust for theconfounders in hypothesis testing. We unify these methods in the sameframework, generalize them to include multiple primary variables and multiplenuisance variables, and analyze their statistical properties. In particular, weprovide theoretical guarantees for RUV-4 and LEAPP, which correspond to twodifferent identification conditions in the framework: the first requires a setof "negative controls" that are known a priori to follow the null distribution;the second requires the true non-nulls to be sparse. Two different estimatorswhich are based on RUV-4 and LEAPP are then applied to these two scenarios. Weshow that if the confounding factors are strong, the resulting estimators canbe asymptotically as powerful as the oracle estimator which observes the latentconfounding factors. For hypothesis testing, we show the asymptotic z-testsbased on the estimators can control the type I error. Numerical experimentsshow that the false discovery rate is also controlled by the Benjamini-Hochbergprocedure when the sample size is reasonably large.
展开▼